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ABSTRACT 


AN ITERATIVE PROCEDURE FOR OBTAINING MAXIMUM-LIKELIHOOD 
ESTIMATES OF THE PARAMETERS FOR A MIXTURE OF NORMAL DISTRIBUTIONS 

This paper addresses the problem of obtaining numerically maximum- 
likelihood estimates of the parameters for a mixture of normal distributions. 

In recent literature, a certain successive-approximations procedure, based 
on the likelihood equations, was shown empirically to be effective in numer- 
ically approximating such maximum-likelihood estimates; however, the reliability 
of this procedure was not established theoretically. Here, we introduce a 
general iterative prodedure, of the generalized steepest-ascent (deflected- 
gradient) type, which is just the procedure known in the literature when the 
step-size is taken to ba 1. We show that, with probability 1 as the sample 
size. grows large, this procedure converges locally to the strongly consistent 
maximum-likelihood estimate whenever the step-size lies between 0 and 2. 

We also show that the step-size which yields optimal local convergence rates for 
large samples is determined in a sense by the "separation" of the component nor- 
mal densities and is bounded below by a number between 1 and 2. 



An Iterative Procedure for Obtaining 


Maximum-Likelihood Estimates of the Parameters 
for a Mixture of Normal Distributions 

by 

B. Charles Peters, Jr. 

NASA/National Research. Council Research- Associate 
Earth. Observations Division, Johnson Space Center 

and 

*Homer F. Walker 

Department of Mathematics, University of Houston 
Houston, Texas 


1. Introduction . 

Let x be an n-dimensional random variable whose density function p 
Is a convex combination of normal densities, I.e., 

m 

P 00 - ili a i p ifc0 for x e I£ n , 

where 




0 , 


m « 

i5i“i 


i, 
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and 


P £ Cx) 


C2ir) n/2 |Z°| 1/2 


t - 1 

^-l/20c-p°) L° (x-y°) . 


If N ^ an ^dependent sample of observations on x, then 

a maximum-likelihood estimate of the parameters {a°, ,£°} . . 

" ' ill l B l|<« • • | in 

Is a choice of parameters {a.,y ,E } - which locally maximizes the 

ill l* i| • • • )in 

log-likelihood function 


L " k-l log 


in which p Is evaluated with the true parameters {a°,]i°,E.°} . - 

111 1 B l) • • • )in 

replaced by the estimate {a. ,y. ,E.) . . (In the following, it is 

ill i*i| • • • fin 

usually clear from the context which parameters are used in evaluating the 
density functions p^ and p. Therefore, these parameters are explicitly 
pointed out only when some ambiguity exists,) We admit local maxima of L 
as maximum-likelihood estimates in order to avoid difficulties presented by 
the fact that L has no global maximum. It is observed below that this 
creates no problems when one Is concerned with consistent maximum-likelihood 
estimates. 

Clearly, L is a differentiable function of the parameters to be estimated. 
Equating to zero the partial derivatives of L with respect to these parameters, 
one obtains, after a straightforward calculation, the follov/ing necessary 
condition for a maximum-likelihood estimate: 
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Cl. a) 


Cl.b) 


Cl.c) 


a. 


“i ? P l^ 
" n t-i 



{- 
l N lc< 


N 


e a 


pO^) 












These are known as the likelihood equations . A number of authors have 
investigated solutions of the likelihood equations and consistency of 
maximum-likelihood estimates. (See, for example, Cramer [2 ] , Huzurbazar [7], 
Wald [12], Chanda [1], and the discussion in Zachs [14].) We observe that, 
loosely speaking, there is a unique solution of the likelihood equations which 
tends with probability 1 to the true parameters as the sample size N 
approaches infinity. Furthermore, this solution is a maximum-likelihood 
estimate, Indeed, the unique strongly consistent maximum-likelihood estimate. 
(Strictly speaking, given any sufficiently small neighborhood of the true 
parameters, there Is, with probability 1 as N approaches infinity, a 
unique solution of the likelihood equations in that neighborhood, and this 
solution is a maximum-likeliiV'od estimate. For completeness, we present a 
brief proof of this result in Appendix 1.) This note is addressed to the 
problem of determining this strongly consistent maximum-likelihood estimate 
by successive approximations. 

The likelihood equations, as written, suggest the following Iterative 
procedure for oL* - jlning a solution: Beginning with some set of starting 
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values, obtain successive approximations to a solution by inserting the 
preceding approximations In the expressions on the right-hand sides of 
Cl. a), (l.b), and (l.c). This scheme is attractive for Its telatlve 
ease of Implementation, and we discuss below the findings of several authors 
concerning its use In obtaining maximum- likelihood estimates. For a dis- 
cussion of other methods of determining maximum-likelihood estimates, see 
Kale [8] and tfolfe [13] as well as the authors given below. 

Empirical studies of Day [3], Duda and Hart [4], and Hasselblad [5] 
suggest that this scheme is convergent and that convergence is particularly 
fast when the component normal densities In p are "widely separated" in 
a certain sense. Unfortunately, the likelihood equations have many solutions 
In general, and the Iterates may converge to solutions, including "singular 
solutions" (see [4]), which are not the strongly consistent maximum-likelihood 
estimate if care is not taken in the choice of starting values. No theoretical 
evidence of convergence is given In [3], [4], or [5], Peters and Coberly 
[10] have proved that, if all of the parameters y^ and are held 

fixed, then the iterative procedure suggested by the equation Cl. a) alone 
converges locally to a maximum- likelihood estimate of the parameters 
a^, i"l,...,m. (An iterative procedure is said to converge locally to a 
limit if the Iterates converge to that limit whenever the starting values 
are sufficiently near that limit.) They also report on numerical studies 
in which the computational feasibility of this procedure is demonstrated. 

In the following, we present a general iterative procedure for 
determine the strongly consistent maximum-likelihood estimate, of which 
the above procedure is a special case. Indeed, our procedure is a generalised 
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steepest-ascent (deflected-gradient) method, and the above procedure is 
obtained when the step-size is taken to be 1. We show that, with probability 
1 as the sample size grows large, this procedure converges locally to the 
strongly consistent maximum-likelihood estimate whenever the step-size is 
between 0 and 2. Furthermore, the value of the step-size which yields 
optimal local convergence rates is bounded from below by a number which always 
lies between 1 and 2. In fact, this optimal step-size lies near 1 if the 
component populations are "widely separated" in a certain sense and cannot be 
much smaLler than 2 if two or more of the component populations have nearly 
identical means and covariance matrices. We also prove that, if the covariance 
matrices are held fixed, then the restricted iterative procedure for the 

parameters and has these local convergence properties with probability 

1 whenever the sample size is at least m(n+l). We conclude by comparing this 
procedure to other numerical methods for determining maximum-likelihood estimates. 
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2. The general Iterative procedure. 

In order to minimize notational difficulties, we introduce several vector 
spaces and give useful representations of their elements. For each i, 

1 £ 1 £ m, a^y^, anc * ^ are elements of the vector spaces IP \ (R n , 
and the set of all real, symmetric n*n matrices, respectively. We denote 
by Ot* 3^T, and A the respective m-fold direct sums of these spaces with 
themselves, and we represent elements of (X t (PZ t and as columns 


O - 


/0,\ 

/y,\ 


th] 

11 

• 

- /• ' 


• 

• 

e C[, V -1 

«J3fT. £• 

l • 

a m 

\v 


[l 

' m 



It will be convenient to adopt the following notational equivalence for elements 
of the direct sum 


1 a ' 
y 
I 



vj 


If, for i » 1, . . . ,m and 0 c Q Q )SI ® ^ 


we denote 



a < N 

V 0) ■ r kEi T<v ’ 







v a ? p i ( V i /a ? !i^V, 

«N k-l^ p(x^) y 'N k*l p(x^) ’ * 

,i N T P ± <* k ) , /,i N P^V i 

s i (0) “In k-i ( V y i )( V u i ) p(x^) y In k«i pc^) I* 


then the livelihood equations can be written as 










( 2 ) 


A(0)\ 

0 - M(0) 

\S(0)| 


where 


A(0) 


1*1 

, M(0) - 

» 

• 

. S(0) - | 

fvi 





■>/ 


One can write (2) equivalently as 


(3) 


0 


$ € (0) = (l-e)O + e 


A(0) 

M(0) 

S(0) 


for any value of €. Of course, (3) becomes (2) when c ■ 1. 

The following iterative procedure is suggested by (3) for obtaining a 
solution of the likelihood equations: Beginning with some starting value 0^\ 
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define successive Iterates lductlvely by 


(4) 


eCk+D „ $ (0^ k )) 
« 


for k - 1,2,3,... . This Is the general Iterative procedure with which this 
note is concarned. Clearly, this procedure becomes the procedure given in the 
introduction when e * 1. 

In the next section, we show that if 0 < € < 2, then, with probability 
1 as N approaches infinity, this procedure converges locally to the strongly 
consistent maximum-likelihood estimate. This is done by showing that, with 
probability 1 as N approaches infinity, the operator is locally co n- 

tractive (in a suitable vector norm) near that estimate, provided 0 < c < 2. 

In saying that is locally contractive near a point 0 e (XQ Q S , we 

mean that there is a vector norm | | | | on 01^ ^ and a number X, 

0 3 A < 1, such that 

(5) ||* c <0’) - ojl s X )( 0* - e]( 

whenever 0' lies sufficiently near 0. 

3. The local contractlbility and converg enc e results . 

We now establish the following 

THEORE M. With probability 1 as N approaches infinity, 4^ is a locally 
contractive operator (in some norm on J)7 ) near the strongly consistent 

muxirnum-likelihood estimate whenever 0 < c < 2. 



Our main result, given by the following corollary, Is an iramedia .e con- 


sequence of this theorem. 

COROLLARY . With probability 1 as N approaches infinity, the iterative 
procedure (4) converges locally to the strongly consistent maximum-likelihood 
estimate whenever 0 < e < 2. 

Throughout the remainder of this paper, the symbol M V" denotes the 
Fr^chet derivative of a vector-valued function of a vector variable. When 
ambiguity exists, the specific vector variable of differentiation appears as 
a subscript of this symbol. For questions concerning the definition and properti 
of Frechew derivatives, see Luenberger [10]. 

Proof of the theorem: Let 



be the strongly consistent maximum- likelihood estimate. We assume that 
Ctj i 0, i * l,...,m. (As N tends to infinity, the probability is 1 that 
this is the case.) It must be shown that, with probability 1 as N approaches 
infinity, an inequality of the form (5) holds whenever e is a sufficiently 
small positive number. 
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For any norm on C( • ^37 Qsj , one can write 

4^(0*) - 0 - V^CO) [O’ - 0] + 0( || 0* - ©|| 2 ) 

Consequently, the theorem will be proved if it can be shown that, for 0 < i < 2, 
V4^(Q) converges with probability 1 to an operator which has operator nor:., 
less than 1 with respect to a suitable vector norm on CK $ * S • 

One can write 74^ as (l-e)I plus a matrix of Frechet derivatives: 

/V 

7$ * (l-e)I + t 7_M 
€ Gt 

\V 

This is consistent with our representation of elements of CftoffC $ as 
columns. 

The entries of the above matrix can themselves be represented as matrices 

of Frechet derivatives. For i * l,...,m, we introduce inner products 

<x,y>' * x T (a i E i ^)y on T*. B and <A,B>'^ ■ tr{A( y- ) B T } or» the space of 

Pl (x) 

real, symmetric n*n matrices, and we define B^Cx) ■ Y^(x) * (x-pj), 

and 5^(x) - [E" 1 (x-p . ) (x-ty) 1 - I], 

After a straightforward but extremely tedious calculation, one obtains 
with the aid of equations (1) that 
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V.A(0) = 


V-A(O) - 
P 


v (9 > = 


V a H(0) 


V 0) 


VrM(Q) 


I - (dlag “jHjj 


1 N 

-(dlag c^H- jlj 


(W\ 

* 

w\ 

• 

C 

• 

\wl 

$ 

w/ 


/W W • ' \ 


».<**>/ \ < 6 n (x k ) V X k )>,> m/ 


-l X 


-(dlag a t ){i k E 1 


IW\ 


\w WvW' 0 ; 


{— E 
l N k=l 


/B 1 (x k h 1 (x k )\ /B 1 (x k )\ 


\ww! \w I 


= I - 


{— E 
k=l 


WiM / <B > i 




<6 m (x k )Y m (x k ) 


» > 


- < dia s ^ k y i < i k )Y i (x k )<s i (i k , ' ,> ? - 

/ <B i ( VW ■ ’ > 'i 


®l* x k* Y l (x k* 


i N • 

l N k=l 


• >" \ T 


P (x. )Y (x,) \<B (x. )Y (x,), • >" 

m k 'in T/ ^ m k m k m 
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V^SCO) 


-(diag J: i ){A k E 1 


IWW /W 


1 I 


J\ 




7 S(0) - (diag l ^ j 1 B i (x k )« i (x k ) <Y 1 (x k )>- >> 


-CdUgE^C-i k l L 


/ / < 8 i^ x k )lr i (x k ) ’ ' 


\WW> WvW’ 


VjSCO) - (diag I. Ji VV'W <W> • >'P - 


- (diag I.){i k E 


/ e i (x k )5 l <X k ) \ 


\ww/ 


f <f w 6 l ( *k ) * • 


\ <B m^ x k )B m (x k ) * ' 


on 


The inner products <*,*>* anc * together with 

(J? \ induce an inner product <•,*> on 0( Q Q . 


V(x) = 


/v" \ 

6 (x) 
tn 

e 1 (x)y 1 (x) 

t 

£ (x)Y (x) 
m n> 

Wi<“ ' 

\ B m (x) V x> 


#© 3)T® i , 



scalar multiplication 
Setting 
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23 


'32 


33 


<dlag djN k=l <6 i (x k ) * ' V ’ 

(diag Z i ^5 Jl Bi^W <Y i (x k ) > ‘ > i ) > 
(dlag E i5lJl e i (x k> 5 l (x k> <S i ( V’ • *V • 


one obtains 


V<D (0) = 
e 


/I 0 

0 I 

V° eB 32 


0 

eB 23 
(l-e)I + eB 


33 i 


- € 


fC diag a ± ) 
0 

\ o 


0 0 

I 0 


\ 


0 (diag 1^)1 


jl V(x k> <V(x k) * * >} 


Denoting the vector of true parameters by 0 , one verifies without 
difficulty that V$ £ is of the form 


7 V 0) - i kli *V 0) * 


where the operator f<x,0) not only has finite expectation (in norm) at 0 
but also has a Frechet derivative with respect to 0 for which the following 
holds: If | | | | is any operator norm on VqF, then there exists a real- 

valued function f on iR n such that 
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J 


f(x)p(x)dx < 


00 



at 0° and such that ||VqF(x, 0)|| < f (x) for all x £ [£ n and all 0 in 
a sufficiently small neighborhood of 0°. Since the solution 0 of the 
likelihood equations is strongly consistent, it follows from the Strong Law of 
Large Numbers (see Lolve [9]) that V$ € (0) converges with probability 1 to 
E(V<t £ (0°) as N approaches infinity. 

To complete the proof of the theorem, it must be shown that E(V4>^(Q°)) 
has operator norm less than 1 with respect to some vector norm on 
or*w»i whenever 0 < e < 2. A straightforward calculation yields 



I 1 ° °] 


'(diag ap 0 0 \ 

e(v* £ (0°)) - 

0 I 0 ! 

-£ 

0 ' 1 0 .. J{ yv(x)<v(x) ,*> p(x)dx}. 


[o 0 ij 


[ 0 . 0 ■ (diag I°)/ R n 


where ct°,y°, and = 1» • • • are the components of 0°. 

E(V4> € (0°)) is an operator on ® JTT B % of the form I - eQR, 


Thus 

where 


Q = 


diag a°) 

0 

V 0 


(diag 1°) / 


and 


; 


V(x) <V(x),»> p(x)dx 


R 
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are positive-definite and symmetric with respect to the inner product <*,»>. 
Since QR is positive-definite and symmetric with respect to the Inner product 
< *,Q > on *rr * J , it suffices to show that 


<W,RW> * <W I Q~ 1 [QR]W> £ <w,q” 1 w> 


for all W e fiZ Q ^ . Indeed, it follows from this inequality that, with 
respect to the inner product <’,Q * >, the operator norm of QR is no greater 
than 1 and, hence, the operator norm of E(V$^(0°)) is less than 1 whenever 
0 < e < 2 . 

For 






i 


i 

I 


one has 


<W,RW> 



y i 3 i (x) + 


+ 


m 

J., tr{B 
i*=l 


i-i v i (a i E i" 1 ) 6 i (x)Y i (x) + 
q o 

1 ( 2 4 I i‘ 1)B i (x) ' 5 i (x)T},2p<x)dx 
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- f [ i 5 1 (°i^” 1 y 1 + 1 Y 1 (x) + 1 )6 i (x) T })a°B i (x) ] 2 p(x)dx 

_ n 

R 

5 / [ Jl (0t i ly i + + tr{B i (^' 1 )6 i (x) T }) 2 a°3 i (x)]p(x)dx 

n 

IR 

The inequality is a consequence of the following corollary of Schwarz's inequality: 

ID 2 01 9 

If ti^ £ 0 for i = and if * 1» then ^ or 

all {£.}. , . If the squared expressions in the last sum above are written 

i i=l , . . . ,m n r 

out in full, one sees that the integrals of the cross terms in these expressions 
vanish. Consequently 


<W,RW> £ J* f° t i” 2 yi + Cv^Z^" 1 Y i (x )) 2 + (tr{B i (^^ 1 )'S i (x) T }) 2 ]a°p i (x)dx 


Now 


JR 


(6. a) 


(6 . b ) 


l- 


c>r 1 y^p i (x)dx = o° *y 2 


/ 


.Tv 0-1. 


JR 


(v“S“ i Y 1 (x)) 2 a°p i (x)dx -J v^E° 1 (x- u °) (x-p°) T *:° Va^p.fxjdx 

" K" 


( 6 .c) J (tr{B 1 (^? rl ) 6 .(x) T )) 2 a°p 1 (x)dx - <E 1 ,r^ 1 b 1 >';. 

K ° 

(A proof of ( 6 .c) follows below.) From ( 6 . a), ( 6 .b), and ( 6 .c), one 


concludes that 
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<W.RW> , j fl'V,. * jj. <v 1 .v 1 >J + - <W.<f V. 


This completes the proof of the theorem. 

Proof of (6.c) : Setting y = ' (x-y°) and 


C - J (tr{B 1 (^ 1 ' 1 )« 1 (x) T » 2 C^P 1 (x)dx 


one verifies that 


c - (tr{ Bl [Zj’ 1/2 yy T £j” 1/2 - ^■ 1 ] T » 2 p 0 (y)dy. 

IR n 

where p Q ~ N(0,I). Denoting * D = (d Jk ) , one then derives 

a° 

C * J (tr{D[yy T - I]}) 2 p Q (y)dy 
« J [(tr{Dyy T )) 2 - 2tr{D)tr{Dyy T > + (tr{D}) 2 ]p Q (y)dy 

(K n 

« ~ {. ^ d d f y.y.y y p (y)dy - 2(tr{D*) 2 + (tr{B}) 2 } 
A j,k,p,q jk pq J 7 k 7 jVP o 


T { £ £ d kk d p P + l ^ k d jk d jk + l ^ k d jk d kj + 3 £ d kk ~ (tr{Dl) } 

Y tr{D 2 } = ^ tr{Z°" 1 / 2 B 1 Zj' 1 B 1 I°“ 1/2 } - } 


<v*rvi* 
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4. The optimal e . 

The results just obtained state that, with probability 1 as N approacnes 
infinity, the iterative procedure (4) converges locally to the strongly 
consistent maximum-likelihood estimate 0 whenever 0 < c < 2. In this section 
we observe that there exists a particular value of c, referred to as "the 
optimal c," which yields, with probability 1, the fastest asymptotic uniform 
rate of local convergence of (4) near 0. We derive a lower bound between 
1 and 2 on the optimal e and relate it to the separation of the component 
populations in the mixture. 

From the proof of the theorem, one sees that the optimal e is that which 
minimizes the spectral radius of the operator E(V<*> € (0°)) restricted to the 
space where £ is the subspace of 0( whose components sum to 

zero. Indeed, the restricted operator E(V<I> e (0 O )) = I - eQR is symmetric on 
£ with respect to the inner product <*,Q ^ >. Consequently, its 

operator norm with respect to this inner product is equal to its spectral 
radius and, hence, minimal. We observe that the restriction of QR to 
£ is positive-definite and symmetric with respect to the inner pro- 

duct <*,Q ^ >. Letting p and T denote, respectively, the largest and 
smallest eigenvalues of this restriction of QR, one verifies that the spectral 
radius of E(V$ € (0°)), restricted to £ © Tff $ , is minimized when 

1 - €T ■ cp - 1, i.e., when c 

It follows from the proof of the theorem that p is never greater than 
1. Thus the optimal t is bounded below by 2/1+1, where T lies between 

0 and 1. In particular, this lower bound on the optimal c lies between 

1 and 2. We have been unable to determine p more precisely in gem rnl . 


2 

(H-T* 
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It should be noted that, if p is strictly less than y , then the optimal 
€ is actually greater than 2, even though the theorem just proved fails to 
guarantee the local convergence of (A) for such values of e. 

Suppose that the component populations in the mixture are "widely separated" 
in the sense that each pair differs greatly from every other such 

pair. Then, for i,j * l,...,m, 


a i p i(*> o^Cx) 

p(x) p(x) 

One sees that QR k J and, hence, p and T must both lie near 1. 
Consequently, fastest asymptotic local convergence rates are obtained for e 
near 1, and, for the optimal €, E(V4>^(0 0 )) * I - cQR x 0. Thus for 

mixtures whose component populations are "widely separated," the optimal c 
is only slightly greater than 1, and rapid first-order local convergence 
of the iterative procedure (A) to 0 can be expected asymptotically for 
this c. 

Now suppose that the component populations in the mixture are such that 
at least two pairs (y°,E°) and i t j , are nearly identical. 

Then ^(x) ~ (x) , B^xJy^Cx) s Sj(x)Yj(x) and (^(xjd^x) ~ B i (x)6^(x), 

and it follows that R is nearly singular and, hence, that T is near zero. 

One concludes that the optimal e cannot be much smaller than 2. In fact, 
if p is near 1, as is the case when all pairs (p°,E°) are ncar ^y identical, 
then the optimal c must lie near 2. Furthermore, the spectral radius of 
E(V*t>^ (0°) ) is near 1, even for the optimal e; therefore, slow first-order 
convergence can be expected asymptotically in this case. 


2 0 for x e JR. 11 whenever i + j . 



5. Maximum- likelihood estimates of the a priori probabilities and the means. 


It happens that, if the covariance matrices E^, 1 ■ l,...,m, are held 
fixed, then, under certain conditions, an appropriately restricted version of 
the iterative procedure (4) converges locally with probability 1 to a maxi- 
mum-likelihood estimate of the parameters a° and y°, i ■ l,...,m, whenever 
the number of observations in the sample reaches a certain finite size. To be 
more specific, we Introduce the following notation: For 0 ■ (^) e OC^JTT and 

€ 0[*TSl*A by (0,E). Then, for given E, the 

likelihood equations for the parameters and i ■ l,...,m can be written 

as 

~ / A(0,E) 

\M(0,E) 

or, equivalently, as 

_ _ _ /A(0,E) 

(7) 0 - *(©,£) = (1~€ )0 + e .. 

€ VH(0,E) 

for any e. The appropriate iterative procedure to consider is the following: 
Beginning with some starting value 0^ , define successive iterates inductively 


E denote 0 
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for It - 1,2,3,... . Our result concerning this procedure Is given by the 
theorem and its corollary below. 

THEOREM : If N * m(n+l) and if (0,E) is a solution of (7) which lies 

sufficiently near a solution of (3), then, with probability 1, is a 

locally contractive operator (in some norm on near 0 whenever 

0 < € < 2. 


COROLLARY: If N 2 m(n+l) and if (0,E) is a solution of (7) which lies 

sufficiently near a solution of (3), then, with probability 1, the iterative 
procedure (8) converges locally to 0 whenever 0 < € < 2. 

* 9 

Proof of the Theorem : Suppose that N 2 m(n+l) and 0 < e < 2. As in the 
proof of the preceding theorem, it suffices to show that, with probability 1, 
Vg4> c (0,E) has operator norm less than 1 with respect to some vector norm 
on . Since depends continuously on 0 and E, this need only 

be shown when (0,E) is a solution of (3). 

mm 

From the proof of the preceding theorem, one sees that is (0,E) is a 
solution of (3) , then 


Vg$ £ (0,E) 


r °) 


r (diag a i ) 

0\ 

\o ij 


i 0 

I i 




where 
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/¥*> 


V(x) - 


B<x) 

ID 


3 1 (x)y 1 (x) 


\ 


0 m ( x )Y fB (x) 
in n 


and the inner product <*,* > is now the inner product induced on • T)! by 

scalar multiplication on (R * and the inner products < # >^ on (R D . As 

before, # (0,E) is of the form I - e QR, where 
w € 



and 


* * Ji 


We observe that Q R is symmetric and positive semi-definite with respect to 
the inner product <*,Q * >. In fact, it is shown in Appendix 2 that, with 
probability 1, QR is positive-definite with respect to this inner product. 
Consequently, the theorem will be proved if it can be shown that 


OJ.Q^IQKJW^ 


<W,RW> < <W,Q 






by Schwarz's Inequality. Since (0,Z) is a solution of (3), this easily 
yields 


<W,RW> s 1 E 1 a" 1 ?* + 1 | 1 v^(a 1 Z 1 )v 1 - <V.? _1 W>, 

and the proo is complete. 

If the conclusion of the theorem holds for some solution (0,1) of (7) , 
then, as in the preceding section, a particular value of c can be determined 
which yields the fa .test uniform rate of local convergence of (8) near 0. 
With respect to the inner product <*,Q . >, Qlt is positive-definite and 
symmetric on Denoting the largest and smallest eigenvalues of the 
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restriction of 7# to ♦3TT by p and T, respectively, one sees that the 




optimal € is again given by c ■ Since the restriction of QK has 

p+i 

W-l 

operator norv. no greater than 1 with respect to the inner product <',Q * >, 
p must be no greater than 1. Hence, t £ TZ~* where I lies between 0 
and 1. Reasoning as before, one sees that the optimal e lies near J. If 
the component populations are "widely separated," and cannot be much less than 
2 If two or more of the populations have nearly identical means and covariance 
matrices. In the former case, rapid first-order local convergence of (8) can 
be expected for the optimal e. In the latter case, if p is near 1, then 
the optimal e jnust be near 2, and slow first-order convergence of (8) 
can be expected, even for the optimal e. 


6* Concluding remarks . 

A number of numerical techniques for obtaining maximum-likelihood estimates 
of the parameters for admixture of normal distributions have been discussed in 
the literature. In addition to the usual steepest-ascent method for obtaining a 
local maximum of the log-likelihood function, we mention in particular Newton's 
method, the method of scoring, and the modifications of these procedures in- 
vestigated by Kale [8j for obtaining solutions of the likelihood equations. 

It is our feeling that the iterative procedure (4) offers considerable com- 
putational advantages over these procedures in many cases of practical interest 

Even though the partial derivatives of the log- likelihood function arc not 
appreciably more difficult to evaluate than the expressions used in defining til- 
function , the procedure (4), which is a generalized steepest-ascent 


25 


(deflected gradient) method appears to have two particular advantages over the 
usual steepest-ascent method. First, the major practical implication of this 
note is that the iterative procedure (A) converges whenever the step-size £ 
lies in an interval which is completely independent of the particular mixture 
problem at hand. It is readily ascertained that this cannot be said for the 
regular steepest-ascent procedure 


0 (q) +£f l 5 W 5 ? 

®i e[ N k=l p(x k ) mN j=l k=l p(x k ) J 

„(q) + [I " c ‘i <)p i ( V ;<q)-i (x , .W)) 

“i + £ l N k£l p(x k ) h tx k “i 

£ (q) j. ,i_ \ a i p i (x fc ) , t (q)-i 1 r(q)-i ( , u (q)\( x u (q) ) T I (q)- 1 

E i * [ 2N k*l p(x,) 1 h +i i (lt k u i Mx k ll i ’ L l 


Second, if c is no greater than 1 the successive iterates defined by (4) 
automatically satisfy the requisite constraints on the parameters, i.e., the 
successive I/s are, with probability 1 for large N, por* tive-def inite and 
the successive a^'s are positive and sum to 1. 

Although Newton's method and the method of scoring offer quadratic and near- 
quadratic convergence, respectively, for large sample sizes, they require at each 
iteration the inversion of a square matrix whose dimension is equal to the number 
of independent variables among the parameters, namely Thus 

these methods nay be less efficient computationally than the iterative procedure 
(4) if m and n are large, even though they may yield a satisfactory approximate 
solution after fewer iterations. The modified versions of Newton's method and 
the method of scoring do not require the re-calculation of the inverse of a 
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large matrix at each step. However, quadratic convergence Is not achieved with 
these modified methods, and multiplication by a large matrix must still be 


carried out at each iteration. 
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Appendix 1 . 

We now give a brief proof of the existence and uniqueness of the strongly 
consistent maximum-likelihood estimate. For the sake of generality, this is 
done in a somewhat broader context than is necessary for this paper. 

n 

Let p(x,0) be a probability density function of a vector variable x e If* 
and a vector parameter 0 c (R V . If {x. M is an independent sample of 

observations on a random variable x e (k n whose probability density function is 
p(x,0°) for some 0° € [jl V t then a maximum- likelihood estimate of 0° is a 
choice of 0 which locally maximizes the log-likelihood function 

N 

L = k | x log p(x k ,0). 

If p is a differentiable function of 0, then a necessary condition for a 
maximum- likelihood estimate is that the likelihood equations 


3L 

3 9 . 


0, i = l,... ( v, 


be satisfied, where 0^ is the i — component of 0. In the following, our 
objective is to show that if p satisfies certain conditions, then, given any 
sufficiently small neighborhood of 0°, there is, with probability 1 as N 
approaches infinity, a unique solution of the likelihood equations in that 
neighborhood, and this solution is a maximum-likelihood estimate of 0°. 

We assume that p(x,0) satisfies the following conditions of Chanda [1]: 
(a) There is a neighborhood of 0° such that for ali 0 e ft, for almost 


and 


exist 


all x e (£ n , 
and satisfy 


and for i, j ,k»l, . . . ,V, 


$r- 


~2 


A 






9 3 log P i 
3« 3B I 
i j k 


* f ijk (x) ’ 


where and 



are integrable and 



satisfies 


f f 1 , k (x)p(x,0°)dx < 

R n 


(b) The matrix J(0) 


( / 3 1° g p3 1° £ . Epdx) 

*" 1 ' 


is positive-definite at 


Let ^£(0) 



It is immediately seen that £ (0) * 0 if and only if the likelihood equations 

are satisfied, and that, by the Law of Large Numbers, ^(0°) converges with 

probability 1 to zero. Furthermore, it follows from assumptions (a) and (b) 

above that there exists a neighborhood ft° of 0° (contained in 0 and, for 

convenience, convex) and a positive e such that, with probability 1 as N 

approaches infinity, 0) < - c I for all 0 c ft 0 . (The inequality is with 

respect to the usual ordering on symmetric matrices.) Denoting the spherical 

neighborhood of radius 5 about 0° by ft^, we establish the following 

L emma : With probability 1 as N approaches infinity, 

(1) is one-to-one on ft°, 
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(ii) (fig) contains the ball of radius e6 about (0°) whenever 

n 6 c n°. 

Proof : We may assume that 0) £ - e I for all 0 e fi°, since the prob- 
ability that this is the case is 1 as N approaches infinity. To prove (i) , 

suppose that j£(0^) = for and in Then 

0 = (0 1 - 0 2 ) T [;Z!(0 1 ) - j£(0 2 ) ] 

- (0 1 - O 2 ) T {/ 2 7^(0 2 + t [0 1 - 0 2 ])dt}(0 1 - 0 2 ). 

1 2 

The negative-definiteness of Vjd implies that 0 = 0 , and (i) is proved. 

To prove (ii) , suppose that fig _£ fi°, and let 0^ be a boundary point 
or fig . Then 

Xw 1 ) */(0°) - {^V^(0° + c [0 1 - 0°])dt}(0 1 - 0°). 

*1 o T 

After left-multiplying this equation by (0 - 0 ) * one verifies using Schwarz's 

inequality and the negative-definiteness of that 

II /(0 1 ) - j£(0°)|| 2 e |] 0 1 - 0°|| - e «, 

where || || denotes the usual Euclidean norm on [R . Since all boundary points 
of arc images under of boundary points of fig, the proof of (ii) 

is complete. 

The desired result of this appendix follows immediately from this lemma and 
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the remarks preceding it. Indeed, if is any neighborhood of 0° which is 

contained in ft 0 , then one can find a 5 for which ft^ £ ft* £ ft°. By the lemma, 
the probability is 1 as N tends to infinity that is one-to-one on ft* 
and that ^(ft^) and, hence, j£(fl*) contain the ball of radius c6 about 
^£(0°). Since ^(0°) converges with probability .1 to zero, one concludes 
that, with probability 1 as N approaches infinity, there exists a unique 
0 £ for which ^(0) = 0. Since the probability also is 1 as N approaches 
infinity that 7^ is negative-definite on ft*, this 0 is, with probability 
1, a maximum-likelihood estimate. 
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Appendix 2 . 

We now prove that the operator QR is positive-definite on OC^TJt with 
probability 1 whenever N 2: m(n+l). Since 


(dlag a ± ) 0 


QR 




it suffices to show that the vectors 


V( V 


p (x k> 

§ 

P(V 

Pi GO 

Tu? ( W 

t 

’•M <w 


k s 1, . . . ,N, 


span ££ $ with probability 1 whenever N £ m(n+l). This follows from the 
more general result below. 

Lemma . Let N be an independent sample of observations on a random 

variable x in JR S which is distributed with a probability density function 

s t 

p. If V is a real-analytic function from JR to JR whose component functions 

are linearly independent, then the vectors VCx^), k“3 N, span [R* with 
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probability 1 whenever Nit. 


Proof : Denoting the 

analytic function V 


j 


j component function of 
from IR. S to by 


V by v j , 


we define a real- 


V x) 


/v i (x) 

• 

i 

t 

\ v j M/ 


for j ■ l,...,t. Our proof of the lemma consists of showing inductively that, 

for j * l,...,t, the set {V,(x.)}, . , spans IR ^ with probability 1. 

j K . • • ,J 

We make the preliminary observation that, since the real- aha lytic functions 

Vj are assumed to be linearly independent, any non-zero linear combination 

of them vanishes only on a set of Lebesque measure zero in [R . 

From the observation above, V^(x^) is non-zero with probability 1; 

hence V^(x^) spans IR * with probability 1. Suppose now that, for some j, 

1 S j < t, the set {V.(x,)}, . spans IR ^ with probability 1. Then, 

J * • • • »j 

with probability 1, the set {V . (x,) }, , ... fails to span JR if 

J >1 K it = J. , • • • ,jtI 

and only if 


(9) 


V j+l (x j+l ) “ k=l C k V j+l (x k ) 


for some set of constants {c, }, , 

K K. 1 1 • • • f 

are determined by 


If (9) holds, the constants c^ 
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with probability 1, where 'V^ is the j*j matrix whose k th column is 

V. (x.). Thus, with probability 1, (9) holds if and only if 

J 


Now 




I v j + l (x l ) \ 


\ V j+l ( v 


v J+l (x J+l ) 




i 


v J +i (x i ) 


\ Vi ( V 


- v j+1 (x) 


is a non-zero linear combination of the functions V j+1 anc *» hence, 

vanishes only a set of Lebesque measure zero in IF 8 . One concludes that 
(Vj + ^(x^) fails to span [R with probability zero. This 

completes the induction, and the lemma is proved. 
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